311 research outputs found

    Learning Temporal Alignment Uncertainty for Efficient Event Detection

    Full text link
    In this paper we tackle the problem of efficient video event detection. We argue that linear detection functions should be preferred in this regard due to their scalability and efficiency during estimation and evaluation. A popular approach in this regard is to represent a sequence using a bag of words (BOW) representation due to its: (i) fixed dimensionality irrespective of the sequence length, and (ii) its ability to compactly model the statistics in the sequence. A drawback to the BOW representation, however, is the intrinsic destruction of the temporal ordering information. In this paper we propose a new representation that leverages the uncertainty in relative temporal alignments between pairs of sequences while not destroying temporal ordering. Our representation, like BOW, is of a fixed dimensionality making it easily integrated with a linear detection function. Extensive experiments on CK+, 6DMG, and UvA-NEMO databases show significant performance improvements across both isolated and continuous event detection tasks.Comment: Appeared in DICTA 2015, 8 page

    Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

    Full text link
    In this paper we address the problem of human action recognition from video sequences. Inspired by the exemplary results obtained via automatic feature learning and deep learning approaches in computer vision, we focus our attention towards learning salient spatial features via a convolutional neural network (CNN) and then map their temporal relationship with the aid of Long-Short-Term-Memory (LSTM) networks. Our contribution in this paper is a deep fusion framework that more effectively exploits spatial features from CNNs with temporal features from LSTM models. We also extensively evaluate their strengths and weaknesses. We find that by combining both the sets of features, the fully connected features effectively act as an attention mechanism to direct the LSTM to interesting parts of the convolutional feature sequence. The significance of our fusion method is its simplicity and effectiveness compared to other state-of-the-art methods. The evaluation results demonstrate that this hierarchical multi stream fusion method has higher performance compared to single stream mapping methods allowing it to achieve high accuracy outperforming current state-of-the-art methods in three widely used databases: UCF11, UCFSports, jHMDB.Comment: Published as a conference paper at WACV 201

    Deep Decision Trees for Discriminative Dictionary Learning with Adversarial Multi-Agent Trajectories

    Full text link
    With the explosion in the availability of spatio-temporal tracking data in modern sports, there is an enormous opportunity to better analyse, learn and predict important events in adversarial group environments. In this paper, we propose a deep decision tree architecture for discriminative dictionary learning from adversarial multi-agent trajectories. We first build up a hierarchy for the tree structure by adding each layer and performing feature weight based clustering in the forward pass. We then fine tune the player role weights using back propagation. The hierarchical architecture ensures the interpretability and the integrity of the group representation. The resulting architecture is a decision tree, with leaf-nodes capturing a dictionary of multi-agent group interactions. Due to the ample volume of data available, we focus on soccer tracking data, although our approach can be used in any adversarial multi-agent domain. We present applications of proposed method for simulating soccer games as well as evaluating and quantifying team strategies.Comment: To appear in 4th International Workshop on Computer Vision in Sports (CVsports) at CVPR 201

    A Deep Four-Stream Siamese Convolutional Neural Network with Joint Verification and Identification Loss for Person Re-detection

    Full text link
    State-of-the-art person re-identification systems that employ a triplet based deep network suffer from a poor generalization capability. In this paper, we propose a four stream Siamese deep convolutional neural network for person redetection that jointly optimises verification and identification losses over a four image input group. Specifically, the proposed method overcomes the weakness of the typical triplet formulation by using groups of four images featuring two matched (i.e. the same identity) and two mismatched images. This allows us to jointly increase the interclass variations and reduce the intra-class variations in the learned feature space. The proposed approach also optimises over both the identification and verification losses, further minimising intra-class variation and maximising inter-class variation, improving overall performance. Extensive experiments on four challenging datasets, VIPeR, CUHK01, CUHK03 and PRID2011, demonstrates that the proposed approach achieves state-of-the-art performance.Comment: Published in WACV 201

    Improved Reinforcement Learning with Curriculum

    Full text link
    Humans tend to learn complex abstract concepts faster if examples are presented in a structured manner. For instance, when learning how to play a board game, usually one of the first concepts learned is how the game ends, i.e. the actions that lead to a terminal state (win, lose or draw). The advantage of learning end-games first is that once the actions which lead to a terminal state are understood, it becomes possible to incrementally learn the consequences of actions that are further away from a terminal state - we call this an end-game-first curriculum. Currently the state-of-the-art machine learning player for general board games, AlphaZero by Google DeepMind, does not employ a structured training curriculum; instead learning from the entire game at all times. By employing an end-game-first training curriculum to train an AlphaZero inspired player, we empirically show that the rate of learning of an artificial player can be improved during the early stages of training when compared to a player not using a training curriculum.Comment: Draft prior to submission to IEEE Trans on Games. Changed paper slightl

    Tracking by Prediction: A Deep Generative Model for Mutli-Person localisation and Tracking

    Full text link
    Current multi-person localisation and tracking systems have an over reliance on the use of appearance models for target re-identification and almost no approaches employ a complete deep learning solution for both objectives. We present a novel, complete deep learning framework for multi-person localisation and tracking. In this context we first introduce a light weight sequential Generative Adversarial Network architecture for person localisation, which overcomes issues related to occlusions and noisy detections, typically found in a multi person environment. In the proposed tracking framework we build upon recent advances in pedestrian trajectory prediction approaches and propose a novel data association scheme based on predicted trajectories. This removes the need for computationally expensive person re-identification systems based on appearance features and generates human like trajectories with minimal fragmentation. The proposed method is evaluated on multiple public benchmarks including both static and dynamic cameras and is capable of generating outstanding performance, especially among other recently proposed deep neural network based approaches.Comment: To appear in IEEE Winter Conference on Applications of Computer Vision (WACV), 201

    SAIVT-QUT@TRECVid 2012: Interactive surveillance event detection

    Get PDF
    In this paper, we propose an approach which attempts to solve the problem of surveillance event detection, assuming that we know the definition of the events. To facilitate the discussion, we first define two concepts. The event of interest refers to the event that the user requests the system to detect; and the background activities are any other events in the video corpus. This is an unsolved problem due to many factors as listed below: 1) Occlusions and clustering: The surveillance scenes which are of significant interest at locations such as airports, railway stations, shopping centers are often crowded, where occlusions and clustering of people are frequently encountered. This significantly affects the feature extraction step, and for instance, trajectories generated by object tracking algorithms are usually not robust under such a situation. 2) The requirement for real time detection: The system should process the video fast enough in both of the feature extraction and the detection step to facilitate real time operation. 3) Massive size of the training data set: Suppose there is an event that lasts for 1 minute in a video with a frame rate of 25fps, the number of frames for this events is 60X25 = 1500. If we want to have a training data set with many positive instances of the event, the video is likely to be very large in size (i.e. hundreds of thousands of frames or more). How to handle such a large data set is a problem frequently encountered in this application. 4) Difficulty in separating the event of interest from background activities: The events of interest often co-exist with a set of background activities. Temporal groundtruth typically very ambiguous, as it does not distinguish the event of interest from a wide range of co-existing background activities. However, it is not practical to annotate the locations of the events in large amounts of video data. This problem becomes more serious in the detection of multi-agent interactions, since the location of these events can often not be constrained to within a bounding box. 5) Challenges in determining the temporal boundaries of the events: An event can occur at any arbitrary time with an arbitrary duration. The temporal segmentation of events is difficult and ambiguous, and also affected by other factors such as occlusions

    Hierarchical Attention Network for Action Segmentation

    Full text link
    The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video. Several attempts have been made to capture frame-level salient aspects through attention but they lack the capacity to effectively map the temporal relationships in between the frames as they only capture a limited span of temporal dependencies. To this end we propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time, thus improving the overall segmentation performance. The proposed hierarchical recurrent attention framework analyses the input video at multiple temporal scales, to form embeddings at frame level and segment level, and perform fine-grained action segmentation. This generates a simple, lightweight, yet extremely effective architecture for segmenting continuous video streams and has multiple application domains. We evaluate our system on multiple challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets, and achieves state-of-the-art performance. The evaluated datasets encompass numerous video capture settings which are inclusive of static overhead camera views and dynamic, ego-centric head-mounted camera views, demonstrating the direct applicability of the proposed framework in a variety of settings.Comment: Published in Pattern Recognition Letter

    Feature-domain super-resolution framework for Gabor-based face and iris recognition

    Get PDF
    The low resolution of images has been one of the major limitations in recognising humans from a distance using their biometric traits, such as face and iris. Superresolution has been employed to improve the resolution and the recognition performance simultaneously, however the majority of techniques employed operate in the pixel domain, such that the biometric feature vectors are extracted from a super-resolved input image. Feature-domain superresolution has been proposed for face and iris, and is shown to further improve recognition performance by capitalising on direct super-resolving the features which are used for recognition. However, current feature-domain superresolution approaches are limited to simple linear features such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which are not the most discriminant features for biometrics. Gabor-based features have been shown to be one of the most discriminant features for biometrics including face and iris. This paper proposes a framework to conduct super-resolution in the non-linear Gabor feature domain to further improve the recognition performance of biometric systems. Experiments have confirmed the validity of the proposed approach, demonstrating superior performance to existing linear approaches for both face and iris biometrics
    corecore